Optimal Multiple Intervals Discretization of Continuous Attributes for Supervised Learning

نویسندگان

  • Djamel A. Zighed
  • Ricco Rakotomalala
  • Fabien Feschet
چکیده

5, av Pierre Mend&s-France 69676 BRON CEDEX FRANCE {zighed,rakotoma,ffeschet)@univ-lyon2.fr In this paper, we propose an extension of Fischer’s algorithm to compute the optimal discretization of a continuous variable in the context of supervised learning. Our algorithm is extremely performant since its only depends on the number of runs and not directly on the number of points of the sample data set. We propose an empirical comparison between the optimal algorithm and two hill climbing heuristics. In the next section, we present a formulation of the problem of discretization, then we describe an extension of Lechevallier’s algorithm to find the optimal discretisation and we insist on the use of runs instead of the points of the sample data set. After, we introduce two hill-climbing strategies. Finaly, we present experiments and empirical studies of the performances of the various presented stategies. Introduction Formulation Rule induction from examples, such as the well known induction trees (Breiman et al. 1984)) usually use categorial variables. Hence, to manipulate continuous variables, it is necessary to transform them to be compatible with the learning strategy. The processus of splitting the continuous domain of an attribute into a set of disjoints intervals is called discretization. In this paper, we focus on supervised learning where we take into account a class Y(.) to predict. Let be Dx the domain of definition of a continuous attribute X( .). The discretization of X( .) consist in splitting DX into k intervals 1j, j = 1,. . . , k with k 2 1. We note Ij = [dj-1, dj [ with dj’s be the discretization points.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discretization Algorithm that Uses Class-Attribute Interdependence Maximization

Most of the existing machine learning algorithms are able to extract knowledge from databases that store discrete attributes (features). If the attributes are continuous, the algorithms can be integrated with a discretization algorithm that transforms them into discrete attributes. The paper describes an algorithm, called CAIM (class-attribute interdependence maximization), for discretization o...

متن کامل

Discretization oriented to Decision Rules Generation

Many of the supervised learning algorithms only work with spaces of discrete attributes. Some of the methods proposed in the bibliography focus on the discretization towards the generation of decision rules. This work provides a new discretization algorithm called USD (Unparametrized Supervised Discretization), which transforms the infinite space of the values of the continuous attributes in a ...

متن کامل

Discretization of Continuous Attributes in Supervised Learning algorithms

We propose a new algorithm, called CILA, for discretization of continuous attribute. The CILA algorithm can be used with any class labeled data. The tests performed using the CILA algorithm show that it generates discretization schemes with almost always the highest dependence between the class labels and the discrete intervals, and always with significantly lower number of intervals, when comp...

متن کامل

Discretizing Continuous Attributes While Learning Bayesian Networks

We introduce a method for learning Bayesian networks that handles the discretization of continuous variables as an integral part of the learning process. The main ingredient in this method is a new metric based on the Minimal Description Length principle for choosing the threshold values for the discretization while learning the Bayesian network structure. This score balances the complexity of ...

متن کامل

A Bayesian approach for supervised discretization

In supervised machine learning, some algorithms are restricted to discrete data and thus need to discretize continuous attributes. In this paper, we present a new discretization method called MODL, based on a Bayesian approach. The MODL method relies on a model space of discretizations and on a prior distribution defined on this model space. This allows the setting up of an evaluation criterion...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997